Correlation and Regression

Lauren Talluto

15.01.2024

More data manipulation

Today we will look at three useful functions for manipulating data frames in more complicated/interesting cases.

trees = read.csv("../datasets/tree_abundance.csv")
head(trees)
##   pid year Abies.balsamifera Acer.saccharum Betula.papyrifera
## 1   1 1980                 4              0                 0
## 2   2 2006                40              0                 1
## 3   3 2006                36              0                 9
## 4   4 1980                 3              0                 0
## 5   5 1980                 4              0                 0
## 6   6 1980                16              0                 0
##   Populus.tremuloides Tsuga.canadensis
## 1                   0                0
## 2                   0                0
## 3                   0                0
## 4                   0                0
## 5                   0                0
## 6                   0                0

More data manipulation: tall vs. wide data

trees = read.csv("../datasets/tree_abundance.csv")
head(trees)
##   pid year Abies.balsamifera Acer.saccharum Betula.papyrifera
## 1   1 1980                 4              0                 0
## 2   2 2006                40              0                 1
## 3   3 2006                36              0                 9
## 4   4 1980                 3              0                 0
## 5   5 1980                 4              0                 0
## 6   6 1980                16              0                 0
##   Populus.tremuloides Tsuga.canadensis
## 1                   0                0
## 2                   0                0
## 3                   0                0
## 4                   0                0
## 5                   0                0
## 6                   0                0

More data manipulation: tall vs. wide data

  1. From wide => tall: melt
  2. From tall => wide: dcast
trees = read.csv("../datasets/tree_abundance.csv")

# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"), variable.name = "species", value.name = "abundance")
head(trees_tall)
##   pid year           species abundance
## 1   1 1980 Abies.balsamifera         4
## 2   2 2006 Abies.balsamifera        40
## 3   3 2006 Abies.balsamifera        36
## 4   4 1980 Abies.balsamifera         3
## 5   5 1980 Abies.balsamifera         4
## 6   6 1980 Abies.balsamifera        16

More data manipulation: tall vs. wide data

trees = read.csv("../datasets/tree_abundance.csv")

# install.packages("reshape2") # run this once, to install the package
library("reshape2")
trees_tall = melt(trees, id.vars = c("pid", "year"), variable.name = "species", value.name = "abundance")
head(trees_tall)
##   pid year           species abundance
## 1   1 1980 Abies.balsamifera         4
## 2   2 2006 Abies.balsamifera        40
## 3   3 2006 Abies.balsamifera        36
## 4   4 1980 Abies.balsamifera         3
## 5   5 1980 Abies.balsamifera         4
## 6   6 1980 Abies.balsamifera        16

More data manipulation: tall vs. wide data

trees_wide = dcast(trees_tall, pid ~ species + year, value.var = "abundance", fill = 0)
trees_wide[1:10, 1:10]
##    pid Abies.balsamifera_1975 Abies.balsamifera_1980 Abies.balsamifera_1985
## 1    1                      0                      4                      0
## 2    2                      0                      0                      0
## 3    3                      0                      0                      0
## 4    4                      0                      3                      0
## 5    5                      0                      4                      0
## 6    6                      0                     16                      0
## 7    7                      0                      0                      0
## 8    8                      0                      4                      0
## 9    9                      0                     30                      0
## 10  10                      0                      0                      0
##    Abies.balsamifera_1988 Abies.balsamifera_1989 Abies.balsamifera_1990
## 1                       0                      0                      0
## 2                       0                      0                      0
## 3                       0                      0                      0
## 4                       0                      0                      0
## 5                       0                      0                      0
## 6                       0                      0                      0
## 7                       0                      0                      0
## 8                       0                      0                      0
## 9                       0                      0                      0
## 10                      0                      0                      0
##    Abies.balsamifera_1993 Abies.balsamifera_1994 Abies.balsamifera_1995
## 1                       0                      0                      0
## 2                       0                      0                      0
## 3                       0                      0                      0
## 4                       0                      0                      0
## 5                       0                      0                      0
## 6                       0                      0                      0
## 7                       0                      0                      0
## 8                       0                      0                      0
## 9                       0                      0                      0
## 10                      0                      0                      0

Taking subsets

balsam_fir_1980 = subset(trees_tall, species == "Abies.balsamifera" & year == 1980)
head(balsam_fir_1980)
##   pid year           species abundance
## 1   1 1980 Abies.balsamifera         4
## 4   4 1980 Abies.balsamifera         3
## 5   5 1980 Abies.balsamifera         4
## 6   6 1980 Abies.balsamifera        16
## 7   7 1980 Abies.balsamifera         0
## 8   8 1980 Abies.balsamifera         4
hist(balsam_fir_1980$abundance)

Boxplots on tall data

par(mar = c(8, 3, 0.1, 0.1)) # adjust the margins
bpl = boxplot(abundance ~ species + year, data = trees_tall, outline = FALSE, xlab = "",
              xaxt = "n")
# do a bit of magic to rotate the axis labels
axis(side = 1, at = 1:length(bpl$names), labels = bpl$names, cex.axis = 0.6, las = 2)

Aggregation

head(aggregate(abundance ~ species + year, data = trees_tall, FUN = mean))
##               species year abundance
## 1   Abies.balsamifera 1975      56.7
## 2      Acer.saccharum 1975       0.0
## 3   Betula.papyrifera 1975       0.0
## 4 Populus.tremuloides 1975      10.3
## 5    Tsuga.canadensis 1975       0.0
## 6   Abies.balsamifera 1980      21.8

Covariance

\[ pr(y|x) = pr(y) \]

We can use the plot function to make a bivariate scatterplot in order to visualise this:

x = rnorm(1000)
y = rnorm(1000) # these variables are independent

plot(x, y, pch=16, bty='n')
Independent random variables

Independent random variables

Covariance

\[ \mathrm{cov}_{xy} = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \]

# in R:
c(  top =    cov(x1, y1),
    middle = cov(x2, y2),
    bottom = cov(x3, y3))
##      top   middle   bottom 
##  0.80012 -0.41042 -0.00761

Correlation

\[ \begin{aligned} \mathrm{cov}_{xy} & = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \\ r_{xy} & = \frac{\mathrm{cov}_{xy}}{s_x s_y} \end{aligned} \]

# in R:
c(  top =    cor(x1, y1),
    middle = cor(x2, y2),
    bottom = cor(x3, y3))
##      top   middle   bottom 
##  0.89570 -0.39786 -0.00738

Correlation significance testing

\(H_0\): \(r = 0\)

\(H_A\): two sided (\(\rho \ne 0\)) or one-sided (\(\rho > 0\) or \(\rho < 0\))

\(r\) has a standard error:

\[ s_{r} = \sqrt{\frac{1-r^2}{n-2}} \] We can then compute a \(t\)-statistic:

\[ t = \frac{r}{s} \]

The probability that \(t > \alpha\) (i.e., use the CDF of the t distribution) is the p-value.

Correlation test in R

\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]

data(penguins, package = "palmerpenguins")
penguins = as.data.frame(penguins)
penguins = penguins[complete.cases(penguins),] # remove NAs

penguins_ade = subset(penguins, species == "Adelie")
n = nrow(penguins_ade)
(r = cor(penguins_ade$body_mass_g, penguins_ade$flipper_length_mm))
## [1] 0.465
(s_r = sqrt((1-r^2)/(n-2)))
## [1] 0.0738
(t_val = r/s_r)
## [1] 6.3
(2 * pt(t_val, n-2, lower.tail = FALSE)) ## two-sided test
## [1] 3.4e-09

Visualisation

The classic way to visualise is a scatterplot

plot(flipper_length_mm ~ body_mass_g, data = penguins_ade, pch=16, 
     xlab = "Adelie Penguin Body Mass (g)", ylab = "Flipper Length (mm)", bty='n')

Correlation test in R

\[ \begin{aligned} s_{r} &= \sqrt{\frac{1-r^2}{n-2}} \\ t &= \frac{r}{s} \end{aligned} \]

penguins_ade = subset(penguins, species == "Adelie")
n = nrow(penguins_ade)
(r = cor(penguins_ade$body_mass_g, penguins_ade$flipper_length_mm))
## [1] 0.465
(s_r = sqrt((1-r^2)/(n-2)))
## [1] 0.0738
(t_val = r/s_r)
## [1] 6.3
(2 * pt(t_val, n-2, lower.tail = FALSE)) ## two-sided test
## [1] 3.4e-09
# Equivalent built-in test
with(penguins_ade, cor.test(body_mass_g, flipper_length_mm, alternative = "two.sided"))
## 
##  Pearson's product-moment correlation
## 
## data:  body_mass_g and flipper_length_mm
## t = 6, df = 144, p-value = 3e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.327 0.583
## sample estimates:
##   cor 
## 0.465

Correlation test: assumptions

Correlation pitfalls

Correlation pitfalls

Heterogeneity of subgroups

Spearman correlation

cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2447  0.0734
## sample estimates:
##     cor 
## -0.0879

Spearman correlation

cor.test(x, y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = -1, df = 148, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2447  0.0734
## sample estimates:
##     cor 
## -0.0879
cor.test(x, y, method = 'spearman')
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 8e+05, p-value = 9e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## -0.357

Tests of association: 2 × 1 Tables

Mendel’s peas

Observation:

colour F2
violet 705
white 224

source

Tests of association: 2 × 1 Tables

Mendel’s peas

Observation:

colour F2
violet 705
white 224

\(H_0\): Inheritance is Mendelian (violet:white = 3:1)

\(H_A\): Inheritance is not exactly Mendelian

# Test of proportions against a null hypothesis
# For small sample sizes, use binom.test
counts = matrix(c(705, 224), ncol = 2)
prop.test(counts, p = 0.75, alternative = "two.sided")
## 
##  1-sample proportions test with continuity correction
## 
## data:  counts, null probability 0.75
## X-squared = 0.3, df = 1, p-value = 0.6
## alternative hypothesis: true p is not equal to 0.75
## 95 percent confidence interval:
##  0.730 0.786
## sample estimates:
##     p 
## 0.759

Tests of association: n × n Tables

Nesting holes of black-backed woodpeckers.

woodpecker = read.csv("../datasets/woodpecker.csv")
head(woodpecker)
##   forest_type nest_tree
## 1      burned     birch
## 2      burned     birch
## 3      burned jack pine
## 4      burned     aspen
## 5      burned     birch
## 6      burned jack pine

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

We want to test for an association between the two variables (forest type and nest tree)

\(H_0\): Nesting tree is not associated with forest type

\(H_A\): Nest tree is associated with forest type

Chi-squared test

Nesting holes of black-backed woodpeckers.

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

table(woodpecker)/rowSums(table(woodpecker))
##            nest_tree
## forest_type  aspen  birch jack pine
##    burned   0.2500 0.6667    0.0833
##    unburned 0.4143 0.2571    0.3286

We want to test for an association between the two variables (forest type and nest tree)

\(H_0\): Nesting tree is not associated with forest type

\(H_A\): Nest tree is associated with forest type

Chi-squared test

Nesting holes of black-backed woodpeckers.

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

table(woodpecker)/rowSums(table(woodpecker))
##            nest_tree
## forest_type  aspen  birch jack pine
##    burned   0.2500 0.6667    0.0833
##    unburned 0.4143 0.2571    0.3286

We want to test for an association between the two variables (forest type and nest tree)

\(H_0\): Nesting tree is not associated with forest type

\(H_A\): Nest tree is associated with forest type

# for 2x2 tables with small sample sizes: Fisher's exact test
# fisher.test()
with(woodpecker, chisq.test(forest_type, nest_tree))
## 
##  Pearson's Chi-squared test
## 
## data:  forest_type and nest_tree
## X-squared = 14, df = 2, p-value = 0.001

Visualisation: Categorical Data

You see this a lot. When should you do it? NEVER

Visualisation: Categorical Data

This is almost as bad, and sadly much more common

Visualisation: Categorical Data

Barplots, or proportional bars for counts within categories

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
                fill = forest_type)) + theme_minimal()
woodp_plot = woodp_plot + geom_bar(width = 0.5)
woodp_plot

Stacked bars are “unfair” — easiest to compare the “rooted” class (unburned).

Visualisation: Categorical Data

Barplots, or proportional bars for counts within categories

table(woodpecker)
##            nest_tree
## forest_type aspen birch jack pine
##    burned       6    16         2
##    unburned    29    18        23

woodp_plot = ggplot(woodpecker, aes(x = nest_tree,
                fill = forest_type))
woodp_plot = woodp_plot + geom_bar(width = 0.5, 
                    position=position_dodge())
woodp_plot = woodp_plot + xlab("Nest Tree Type") +
    theme_minimal() + labs(fill = "Forest Type")
woodp_plot

Side-by-side bars allow us to compare all categories on equal footing.

Visualisation: Ordinal Data

Scatterplots become less useful.

birddiv = read.csv("../datasets/birddiv.csv")
bird_plot = ggplot(birddiv, aes(x=forest_frag, 
                y = richness, colour = bird_type)) + 
                geom_point() + theme_minimal()
head(birddiv)
##   Grow.degd For.cover  NDVI bird_type richness forest_frag
## 1       330      99.9 60.38    forest        8           1
## 2       330       0.0 22.88    forest        1           0
## 3       330      38.3 11.86    forest        5           3
## 4       330      11.4 19.07    forest        7           7
## 5       330       0.0  2.12    forest        2           0
## 6       170     100.0 54.03    forest        7           1

Visualisation: Ordinal Data

Adding jitter can sometimes improve things

bird_plot = ggplot(birddiv, aes(x=forest_frag, 
                y = richness, colour = bird_type)) + 
                geom_jitter() + theme_minimal()

Visualisation: Ordinal Data

Better solution: boxplots

bird_plot = ggplot(birddiv, aes(x=as.factor(forest_frag), 
                y = richness, fill = bird_type)) + 
                geom_boxplot() + theme_minimal()
bird_plot = bird_plot + xlab("Forest Fragmentation")

The linear model

\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]

\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:

\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

Method of least squares

\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

We want to solve these equations for \(a\) and \(b\)

The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))

Method of least squares

\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]

\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]

Ordinary least squares estimation

Solving for the minimum ESS yields:

\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]

The parameters have standard errors:

\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]

\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \]

Significance tests

\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))

The ratio of variance explained to residual variance follows an \(F\) distribution

\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-1} \end{aligned}\]

Goodness of fit

The coefficient of determination tells you the proportion of variance explained by the model:

\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]

Regression assumptions

Transformations

Regression in R

## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17), 
        wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2, 
                        3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
## 
## Call:
## lm(formula = wing_length ~ age, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3070 -0.2154  0.0655  0.1632  0.2251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7131     0.1479    4.82  0.00053 ***
## age           0.2702     0.0135   20.03  5.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.218 on 11 degrees of freedom
## Multiple R-squared:  0.973,  Adjusted R-squared:  0.971 
## F-statistic:  401 on 1 and 11 DF,  p-value: 5.27e-10
confint(mod)
##             2.5 % 97.5 %
## (Intercept) 0.388   1.04
## age         0.241   0.30

Regression in R: diagnostics

par(mfrow=c(1, 2), bty='n')

## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)


## also try
## plot(mod)

Regression in R: diagnostics

## also try
par(mfrow = c(2,2), bty='n', mar = c(3, 3, 2,0), mgp = c(2, 0.5, 0))
plot(mod)

Presenting regression output

We found a significant positive relationship between wing length and age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1).

Table 1. Parameter estimates for regression of wing length on age
Estimate       St. error       95% CI
Intercept       0.71 0.15 [0.39, 1.03]
Age 0.27 0.013 [0.24, 0.30]